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Abstract- Internet worms are identified as one of the serious security threats caused by their^jomalous 

J/e nt 

traffic is monitored and their abnormal behavior on the Internet is detected andSflasjStfied based on 



behaviors. Worms in the network cause many cyber security threats such as distributed de|fW er Service, 
illegal network traffic, spreading spam and stealing personal user information. In this pa^rOfie network 



attribute payload using C4.5 algorithm with Pearson's Correlation Coefficient. TJreu?fcoposed approach 
detects the Internet worm activities based on their traffic behavior using a deajjsJree algorithm. The 
categorization of continuous attributes is performed by C4.5 algorithm to teswJfCflistinct values and the 
method reduces the processing time of the large input. The experimental r^Tts obtained identify the 
unknown worms with improved accuracy in detecting malicious flows 4 "ISiJ^|s«lts obtained show improved 
precision value; recall value and better accuracy detection of Intern^ 



Keywords: Traffic flow, Attribute vector, Information Gain, Pear/^>C»frrelation Coefficient. 

1 Introductu 





In the Internet world today, Worms cause billion dolkJSfcimage every year throughout the world. Internet 
worms are those malicious codes which propagate ^Erratically by themselves. Moreover, they do not need 
any human intervention for their propagation inffe^^ vulnerable hosts. The worms spread in the network 
affects the computers by stealing their corffT^ential information, deleting files, reducing the speed of 
network functioning, creating a DistributedE^iial of Service (DDOS) and with the infected hosts, they also 
further damage other hosts by launchin^^p^ks [1] [5] [B] [12]. 




lm%a\ 



Using epidemic spreading stylevN i h%a and Code Red Worms caused immense damage in the Internet 
world during 2001 In 2003, Slar/mter worms within 3 minutes, scanned more than 55 million machines and 
damaged within D minutesrtr|?)rTy 90% of vulnerable hosts in the network [8]. In 2004, Witty Worm 
affected more than K,00^tflCerable hosts within minutes and in 2007, Storm worm damaged millions of 
computers [2]. Conficta\ifcfrYri in 2008 infected the cloud network and controlled 6.4 million vulnerable 
hosts globally in 2313^/htries [9] [U]. Information securities top research is to stop the propagation of 
worms on the irtefaefl 



Internet wortrOneating illegal traffic behaviors are one of the challenges of existing in the network. These 
intrusiorhi^wid in networks are done by worms in the form of payload replication and malicious traffic 
^r^Packet payload analysis will not provide better network security if the contents are encrypted. To 
the above limitation with payload monitoring, traffic should be scanned and detected. Existing 
lacks in handling the missing values when they are huge and are stored only once in numeric 
attributes [3]. The approach proposed achieves better detection accuracy by reducing the missing values 
and grouping the continuous attributes. C4.5with Pearson's Correlation Coefficient minimizes the data file 
input space and communication overhead. 

The organization of the paper is as follows: Section 2 reviews the related works on Internet Worm 
detection. Section 3 describes the proposed methods for the Internet Worm detection based on attribute 
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payload. Section 4 describes the experimental evaluation results for detection. Finally, in section 5 paper is 
concluded. 

2. Related Works 



Internet worms infect the network through illegal traffic flow. Monitoring and detecting the malicious 
traffic behavior provides better and faster communication. Rather than payload Inspection, traffic flow 
monitoring detects the network traffic and exploits the internet worms illegal traffic. Various techniques 
proposed for Internet worm detection are listed below. 



cs»us 



Chao Chen et al. [2] proposed a novel approach, that it diminished the future internet epidamjcs».asing 
effective technique of Divide-conquer-scanning worm. This technique is faster and steal thjpOri an the 
random-scanning worm. In this paper author also described two defense mechanism, they^ri^lrected host 
removal and active honey nets. Deguang Kong et al. [4] Introduced a novel methCd foVdetecting the 
network based worm. It first generates the signatures automatically by Semapbl^V^ wai ' e statistical 
algorithm. This is used to remove the non-critical bytes, which is combined with aJafcid/h M arkov model to 
automatically generate worm signatures. 

Jun-qun et al. [6] Analyzed to find the vulnerable host. Here the authar>«ap|^mented the gradual hybrid 
anti-worm. This approach was combination of active and passive antfi^V^iJ. The work done by the active 
anti-worm was detecting the vulnerable host on the network and them up. Listening process was 

handled by passive anti-worm, that it attacks the worm from thejjd^tyrter patching it for the process.Q ian 
Wang et al. [7] Proposed the approach to analyze the internet w^Jinfection family tree and it is named as 
worm tree. Through mathematical analysis, captures the k^y\hafacteri sties of the internet worm detection 
and applying it for bot detection. 




Yu Yao et al. [B] Implemented an approach based\h4ifhe delay to reduce the network worm and also 
decrease the economic loss rate. In this paper, oft^wcal value is derived. If the time delay is greater than 
the critical value, then the worm will beelimu^edffom the network. Zaki et al. [14] Introduced WSRM AS; 
an anti-worm system. This approach effec/^^reduces the spreading of the infected worms in network 
routers and consists of a multi-agent sysjer\Nnat can limit or even stop the worm spreading. 

Table 1 lists the different techruquei^oposed and implemented by various authors and the parameters 
used for their experimentation affllevaTuation. The observations show the achievement of the methods and 
they are listed in the table lr^StA 



b^^ 



Review of Literature for I nternet W orm Detection 



Year 


AuthN^ 


y7 echnique(s) 
' Used 


Parameters Used 


Observations 


20 D 


CLa^Cien 


Divide- 
conquer- 
scanning worms 


Scanning Rate, Scanning 
Probability, and 
Scanning Space 


^ Analyze the characteristics 
of DCS worms and potential 
countermeasures. 




Zaki and 
Hamouda 
201) 


Multi_Agent 
system 


Number of machines, 
Detection time, Worm 
spreads interval, Anti-worm 
spreading interval, Anti- 
worm movement interval 


Centralized planning capability 

improves the system 
effectiveness by decreasing the 
percentage of infected machines 
with about 40%. 


2011 


Deguang 
Kong et al 


Semantics 

Aware 
Statistical 
algorithm 


False Positive, 
False Negative 


Accurately detect worms with 

concise signatures. Fast in 
online detection speeds, better 
in noise tolerance. 
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2011 


Jun-qun et 
al 


Gradual Hybrid 
Anti-Worm 
System 


Transformation threshold 
rate 


Detect and attack the worm 
using active and passive anti- 
worm, then patches up. 


2011 


Yu Yao et 
al 


Time delay in 
quarantine 


TO 


Network worm was detected 
and eliminated using time delay 
and decreases the window size. 


2012 


Qian Wang 
etal. 


Probabilistic 
Modeling and 

Sequential 
Growth Model 


Geometric 
Distribution with parameter 
0.5 


For forensic analysis, bots from 
bot assessment is exposed tik 
worm tree. 




Monitoring traffic 
posed approach 



From the above table 1 the different methods have been proposed to detect the Internet wjyiiiififecting 
the network. From the observations, it is found that they detect through monitoring pafl&^/and traffic 
misbehaviors. Payload detection lacks detection of worms when they are encryp 
behavior detects only after their spread. To overcome the above limitations, 
detects the I nternet worms by monitoring the traffic flow collection. 

3. Proposed Methodology 

The proposed approach finds out the malicious traffic based on trj^Ktateferi sties of network flow using 
improved C4.5 algorithm. To classify the Internet worms, TCP andACTvtows are examined, they are split 
into time windows and attributed vector is extracted. Based or^p^Rribute vectors malicious and non- 
malicious traffic is detected and classified. The existing methdVyjEPTree is a decision tree learner, uses 
information gain as splitting criterion. Its limitation is thatlhV: numeric attribute values can be sorted once 
only. C4.5 with Pearson's Correlation Coefficient gives arw^bnt classification between the malicious and 
non-malicious in network traffic based on their flowthVacteri sties. Figure 1 below shows the complete 
process of detecting I nternet worms through their ttJ^iWIows. 



Packet 



Input 



> Monitor Traffic Calcula 
How <LX^ V 



Apply Entropy for each 
Attribute 




Apply entropy 
Splitting Criterion 



Calculate InformationGain 
and Gain Ratio 



Figure 1 Proposed flow of the overall process 

The steps follow^Aar monitoring and detecting Internet worms based on the network flow characteristics 
is shown in thttHfere 2 below 



Table 2. Steps Proposed for Malicious Traffic Flow Detection 



tep 1 Create an initial node and calculate Attribute Vector for incoming flow 
Step 2: Apply entropy, Information Gain and Gain Ratio for each attribute 
Step 3: Select highest Information Gain for attribute A, for best splitting 
criterion. 

Step 4:Consider the best splitting criterion and partition the flows. 

Step 4: If the linear relationship value exceeds then set as Malicious Flow 

occurred. 
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The above figureland table 2 shows the proposed methods procedure and the steps involved in detecting 
the Internet worms based on their illegal traffic flow. 

A. Selection of Attribute vector 



Attribute represents the numeric value that contains the collection of flows gathered at a time W indow T 
given. It consists of source and destination IP address and ports of flow of destination and source, average 
time for the packets and average length of packets exchanged in the time interval, for effective detection of 
internet worms. 



Attribute vector consists of the collection of attributes, that gathers the characteristics of indivi 
a specified time interval. The attribute vector is measured by comparing the total number of 
a particular single address with the total flows count made in some limited time perio 
proposed are decision tree's C4.5 algorithm with Pearson correlation coefficient 



B. C4.5 Algorithm 




flf>w for 
ade by 
echniques 



C4.5 is one of the decision tree based algorithm with a big tree and it contain/^tain attributes values and 
finalizes the decision rule using pruning method. It has features«>el^ais handling missing values, 
categorization of continuous attributes, pruning of decision trees and^^l^privation. 

The most significant attributes are selected by considering aiLpkie^amples, in which root nodes are 
considered as the top nodes of the tree. The subsequent nodesj^fkh are termed as branch node, receive 
the sample information. The decision is made when it isterj^Viated in the leaf node. Root node to leaf node 
is a path defined by several notes in which rules are genes 




Some limitations of C4.5 algorithm are empty brafocnss, insignificant branches and over fitting. Empty 
branches make the tree bigger and more complex^^gnificant branches reduce usage of decision tree and 
over fitting. Over fitting branches picks up tha^ata with uncommon characteristics. 

To overcome the limitations and detedfls^ Internet worms, steps are considered for constructing C4.5 
algorithm in the table 4 below C^\^ 

■JeMe4. Steps to Detect using C4.5 Algorithm 

llfX^s are of the same class, the tree is a leaf and is returned 
is class. 



►each attribute, calculate potential information provided by a test 
tribute. 

^ftfjf Also calculate the gain in information that results from a test on the 
^trribute. 

>tep 4: Find best attribute to branch depending on the current selection 
criterion. 



C. Counting Gain 



Here entropy is implemented and is defined as to measure or calculate the disorder of the data. It is defined 
as Entropy(y) = -Z] =1 1 ^ log 1 -^ (1) 

Iterating over all possible values of|y|. The conditional Entropy is Entropy{j\y) -^log^- 



(2) 

The gain is defined as Gain{y,j) - Entropy{y — Entropy{j\y)) 



(3) 
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The goal is to maximize the gain, dividing by overall entropy due to split the argument y by value j. 
From the above equation 3, entropy has some limitations. They are listed below 

If a large number of distinct values are used by both continuous and discrete attributes, then it provides 
poor result. 

There is no particular technique for predicting the information gain also this information gain is generated 
after attributes value generation. The system gives the less performance and accuracy based oh^t e 
mismatch of the attribute value. 

The system become a failure because of the amount of attribute is very much higher than thfyfljrmation 
gain. 

More difficult to select the next attribute value, if the previous attribute value ia^^than it leads to 
unconditional selection of attributes. ,»V_^ 

When same valued attributes are used in the decision tree generation, they spf^Hp is a complex task; gives 
unbalanced trees. vA"N ♦ 

To overcome the above limitations in C4.5, the proposed apprpfsNintroduced Pearson's Correlation 
Coefficient and is used for entropy. This solves the unconditio»oWs)€ction of attributes, poor result, less 
performance and accuracy based on the mismatch of theattribundlue, uncertainty in entropy. 




D. Improved C^^Tgorithm 

To overcome this entropy limitation of C4.5 algotH^m, improved C4.5 algorithm is introduced. Entropy is 
used to find out the linear relationship bet^e^two variables by comparing their strength and direction. 
These variables are determined by-lto +1 dEKn the maximum value of +lusing perfect linear relationship 
by increasing relationship. Attain thegai^Nirsing perfect linear relationship by decreasing relationship. For 
not linear case it attains the value o^^o^u) 

Let X and Y be the two interv tio variables. Joint distribution of these two variables is called bivariate 
normal. Pearson's Correl^kwQjfbefficient is used for evaluating the entropy. Equation 4 gives the formula 
for Pearson's Correlatioi^^fficient 



lontc 



r xv = = n ■ or (4) 
r xy = j 1XY ~ a (5) 



Where, ZX is the sum of all the X scores, ZY is the sum of all the Y scores, 2X 2 is square of each X score and 
then total of them, £Y 2 is square of each Y score and then total of them, £XY is multiply of each X score 
by its associated Y score and then add of the resulting products together. 



Table 5 below shows the pseudocode for the proposed method. The algorithms used are C4.5 algorithm 
with the Pearson correlation coefficient for better accuracy detection rate. 
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Table 5. Pseudocode for proposed approach 



Input: Dataset containing tuples with a set of attributes, D 
Output: Decision tree with highest information gain 
Procedure 

Create an initial nodeN 
Apply entropy for each attribute in the dataset 
Calculate information gain and gain ratio for each attribute in a dataset 
Select highest information gain of attribute A (D, attribute list) to find 
the best splitting criterion 

While not end of attributes do 

If the dataset is partitioned with a single attribute A then 
Apply entropy and information gain for attributeA 
Apply gain ratio for attribute A t+ > 

Select highest information gain of attribute A to fincfi^ybest 
splitting criterion 

Else <>) 

Declare as leaf node »/> 
End if - v 



0 



End while 



In the above table5, the traffic flows collected are monitored a, 
C4.5 algorithm steps. Further enhancing its accuracy deb 
integrated. ^\ ♦ 



Figure 2 below shows the detailed flow of proposed rfcstess C4.5 with Pearson coefficient correlation for 
detecting and classifying Internet worms based on bjivn&etwork traffic flow characteristics. 



lion sain and sain ratio for each attribute 






malicious flows are detected using 
Pearson coefficient correlation is 



lest information sain of attribute A ID. attribute list) 



Calculate entropy for attribute A 



Declare as Leaf node 



Calculate information sain and sain ratio for attribute A 



Find best splitting criterion 



End 



Figure 2: Flowchart for Proposed Approach 
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The proposed approach at the network level monitors the illegal traffic behavior based on their attributes 
and classifies the detected Internet worms. Figure 2 above shows the proposed approach uses the decision 
tree algorithm C4.5 with Pearson's Correlation Coefficient for detecting and classifying the existence of 
I nternet W orms i n the network. 



4. Experimental Results 



The proposed systems are evaluated by using various parameters such as precision value, recall value and 
accuracy. 

V 

Precision value - Precision refers to the retrieved document. This is calculated by the total /griper of 
relevant documents divided by the total number of resultant documents. 



Precision Value = 



True Positive 



True Positive + False Positive 
Recall value- Recall value is referred to as the relevant documents that are relat 

True Positive 



0 



95 



e search request. 



Recall Value = 



False Positive + False i 



Accuracy-Accuracy provides the required related documents/ mea^uCegs 

True Positive + ^ru&lVt 

Accuracy - 




sed for classification. 

egative 



True Positive + False Positiv&^&mie Negative + False Negative 

The proposed system is implemented using Java. Ben£msfcark dataset is collected from the internet through 
the web. The dataset contains the total of 5, O^&Sdata. The dataset contains both malicious and non- 
malicious traffic data. This proposed work coracarecrwith the existing system of reduced error pruning tree 
provides better accuracy in the detection of/fOSious traffic flows. 



Table 6. Compaastfi^^xisting and proposed approach performance 



Parameters 


Existto^ 


Proposed C4.5 with Pearson Correlation 
Coefficient 


%of 
Improvement 


Precision Value^ 

(%) 


^/ 67% 


81% 


2Wo 


Reca^$i?e(%) 


79% 


88% 


11% 


Accuracy (%) 


65% 


76% 


17% 



The table 6 provides the comparison of parameters between existing reduced error pruning method and 
proposed improved C4.5 algorithm. The given parameters are Precision Value in (%), and Recall Value in 
(%) and Accuracy in (%). 
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Figure 3. Comparison of recall and Preg%tfrX ♦ 

XV 



<2T 



Figure 3 above represents the precision and recall value obtained 
The results illustrate that C4.5 with Pearson's Correlation Coe 
existing reduced error pruning algorithm. 




existing and proposed approach, 
performs better accuracy than the 



Proposed C4.5 with Pearson 
Correlation Coefficient 



Techniques 



Figure 4. Comparison of Accuracy 

From the^to*e figure 4, overall accuracy obtained by existing reduced error pruning is 65% and proposed 
C4.5 al/CfSFnrn is 76%. From the figure 4, it is clearly noticed that the proposed technique of C4.5 algorithm 
g^e^K*tt€r accuracy level and improved performance than existing approach. 

5. Conclusion 



Internet worms are serious and challenging threats in the network and communication security. In this 
paper, malicious and non-malicious payload traffic is detected based on attributes. Improved C4.5 with a 
correlation coefficient improves the accuracy of detection overcoming the limitations of existing 
approaches. Based on the traffic flow characteristics, the attribute vector calculates the continuous flow of 
attributes. Moreover, the proposed approach provides the decision tree with highest information gain. The 
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scanning worms", ELSEVIER, Computer Networks, Vol. 54, 20D, pp. 32D-3222. 

■aff^'&t 

>ngs\er 



3. David Zhao, IssaTraore, Bassam Sayed, Wei Lu, Sherif Saad, Ali Ghorbani and Dan GararrL "^otnet 
detection based on traffic behavior analysis and flow intervals", ELSEVIER, Computa«t?3ecurity, 
Vol .39, 2013, pp. 2-B. J}0 



proposed method provides better detection accuracy with high precision and recall value than the existing 
method. 
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